Hello.
I want to work with an csv. The file is more than 20 GB heavy so i cannot read it normally. I was trying to import it by parts using rowrange, to filter what i need, save it and then append them.
The issue is that at some point, the import doesn't keep the variables names and names them based on the first observation of that chunk or something like that.
I have tried different chunk sizes, starting the loop at different points ans this keeps coming up, at differents points of the data. Any thoughts on why this happens?
(More than welcome to hear other possible approaches to this as well).
Here is my code:
{
The size of the chunk (500,000) and the limit of the while (15 mill) are arbitrary. I don't know how many observations i have (I believe something around 65 mill). As I said, I have tried different chunk sizes and lenghts of the while. With this numbers, at the 15th loop, when it pulls from 7,000,001 to 7,500,000 it does it "succesfully" but forgets the variable names and when i try to to replace estado it naturally issues a "variable estado not found r(111);" I tried independently importing from 6,900,000 to 7,100,000, from 7,100,000 to 7,200,000 and so on and it works perfectly,
Any suggestions?
Thanks in advance.
Carlos Aburto Castellanos
I want to work with an csv. The file is more than 20 GB heavy so i cannot read it normally. I was trying to import it by parts using rowrange, to filter what i need, save it and then append them.
The issue is that at some point, the import doesn't keep the variables names and names them based on the first observation of that chunk or something like that.
I have tried different chunk sizes, starting the loop at different points ans this keeps coming up, at differents points of the data. Any thoughts on why this happens?
(More than welcome to hear other possible approaches to this as well).
Here is my code:
{
Code:
local k = 1 local j = 500000 local h = 1 while k < 15000000 { import delimited using "file.csv", rowrange(`k':`j') clear replace estado = "MEXICO" if estado == "MÃXICO" replace estado = "CDMX" if estado == "DISTRITO FEDERAL" keep if inlist(estado, "CDMX", "MEXICO", "HIDALGO") // This drops around 70% of the data, on average save file_chunk_`h'.dta, replace local k = `k' + 500000 local j = `j' + 500000 local h = `h' + 1 }
Any suggestions?
Thanks in advance.
Carlos Aburto Castellanos
Comment